The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Embedding tables are usually huge in click-through rate (CTR) prediction models. To train and deploy the CTR models efficiently and economically, it is necessary to compress their embedding tables at the training stage. To this end, we formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training (LPT). Also, we provide theoretical analysis on its convergence. The results show that stochastic weight quantization has a faster convergence rate and a smaller convergence error than deterministic weight quantization in LPT. Further, to reduce the accuracy degradation, we propose adaptive low-precision training (ALPT) that learns the step size (i.e., the quantization resolution) through gradient descent. Experiments on two real-world datasets confirm our analysis and show that ALPT can significantly improve the prediction accuracy, especially at extremely low bit widths. For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy. The code of ALPT is publicly available.
translated by 谷歌翻译
完全有监督的语义细分从密集的口罩中学习,这需要封闭设置的大量注释成本。在本文中,我们使用自然语言作为监督,而无需任何像素级注释进行开放世界细分。我们将提出的框架称为FreeSeg,在该框架上可以从训练训练型模型的原始功能图中免费获得。与零射击或开放集分割相比,freeSeg不需要任何带注释的掩码,并且可以广泛预测超出类无需监督的分段之外的类别。具体而言,FreeSeg从图像文本相似性图(ITSM)中获得了可解释的对比度图像预处理(ICLIP)的自由掩码。我们的核心改进是浓密ICLIP的平滑最小池,具有部分标签和像素的分割策略。此外,没有复杂的设计,例如分组,聚类或检索,很简单。除了简单性外,Freeseg的表现超过了以前的最先进的边缘,例如在同一设置中,MIOU在MIOU上的13.4%。
translated by 谷歌翻译
目前,下一个位置推荐在基于位置的社交网络应用程序和服务中起着重要作用。虽然已经提出了许多方法来解决这个问题,但到目前为止,三个重要挑战尚未得到很好的解决:(1)大多数现有方法基于经常性网络,这是耗费训练长期序列,因为不允许完整的平行度; (2)个性化偏好通常不被认为是合理的; (3)现有方法很少系统地研究了如何在轨迹数据中有效地利用各种辅助信息(例如,用户ID和时间戳)和非连续位置之间的时空关系。为了解决上述挑战,我们提出了一种名为SANMOVE的新型方法,是一种自我关注网络的模型,通过捕获用户的长期和短期移动模式来预测下一个位置。具体而言,SANMOVE引入了一个长期偏好学习模块,它使用自我关注模块来捕获用户的长期移动模式,可以代表用户的个性化位置偏好。同时,SanMove使用空间延伸的非侵入自我关注(Stnova)来利用辅助信息来学习短期偏好。我们使用两个真实世界数据集进行评估SANMOVE,并演示SANMOVE不仅比基于最先进的RNN的预测模型更快,而且还优于下一个位置预测的基线。
translated by 谷歌翻译
最近,卷积增强的变压器(构象异构体)在自动语音识别(ASR)中显示出令人鼓舞的结果,表现优于先前发表的最佳变压器传感器。在这项工作中,我们认为编码器和解码器中每个块的输出信息并不完全包容,换句话说,它们的输出信息可能是互补的。我们研究如何以参数效率的方式利用每个块的互补信息,并且可以预期这可能会导致更强的性能。因此,我们提出了刻板的变压器以进行语音识别,名为BlockFormer。我们已经实现了两个块集合方法:块输出的基本加权总和(基本WSBO),以及挤压和激气模块到块输出的加权总和(SE-WSBO)。实验已经证明,阻滞剂在Aishell-1上大大优于基于最新的构象模型,我们的模型在不使用语言模型的情况下达到了4.35 \%的CER,并且在4.10 \%上具有外部语言模型的4.10 \%测试集。
translated by 谷歌翻译
点击率预测是商业推荐系统中的核心任务之一。它旨在预测用户点击给定用户和项目特征的特定项目的概率。随着特征相互作用引入非线性,它们被广泛采用以提高CTR预测模型的性能。因此,有效的建模特征互动在研究和工业领域引起了很多关注。目前的方法通常可以分为三类:(1)NA \“IVE方法,它不会模拟特征交互,只使用原始特征;(2)记忆方法,通过显式将其视为新功能而记住功能交互。分配可培训嵌入式;(3)分解方法,学习原始特征的潜在矢量和通过分解功能的隐式模型相互作用。研究表明,由于不同特征相互作用的独特特征,这些方法之一的建模特征交互是次优。为了解决这个问题,我们首先提出一个称为OptInter的一般框架,该框架可以找到每个功能交互的最合适的建模方法。可以将不同的最先进的深度CTR模型视为optinter的实例。实现功能Optinter,我们还介绍了一种自动搜索最佳建模方法的学习算法。W e在四个大型数据集中进行广泛的实验。我们的实验表明,Optinter可提高最佳的最先进的基线深度CTR模型,高达2.21%。与回忆的方法相比,这也优于基线,我们减少了高达91%的参数。此外,我们进行了几项消融研究,以研究Optinter不同组分的影响。最后,我们提供关于替代替代品结果的可解释讨论。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译